Attention Heads of Large Language Models: A Survey
๐ Abstract
The paper surveys the latest research on the interpretability of attention heads in large language models (LLMs). It proposes a four-stage framework for understanding the reasoning mechanisms of LLMs, inspired by human cognitive processes: Knowledge Recalling, In-Context Identification, Latent Reasoning, and Expression Preparation. Using this framework, the paper categorizes and analyzes the functions of various attention heads discovered in recent studies. It also summarizes the experimental methodologies employed to uncover these special attention heads, dividing them into Modeling-Free and Modeling-Required approaches. Additionally, the paper discusses relevant evaluation tasks and benchmarks, as well as related research on feed-forward networks and machine psychology. Finally, it outlines the limitations of current research and suggests potential future directions.
๐ Q&A
[01] Knowledge Recalling (KR)
1. What is the role of attention heads in the Knowledge Recalling (KR) stage of LLMs?
- Certain attention heads function as associative memories, gradually storing and retrieving knowledge during the model's training phase. This includes retrieving content related to the current problem from the model's parametric knowledge.
- In specific task scenarios, LLMs may use Constant Head to evenly distribute attention scores across all options, or Single Letter Head to assign higher attention to one option, capturing all potential answers.
- LLMs can also exhibit a negative bias in Binary Decision Tasks due to a Negative Head that focuses more on negative expressions learned from prior knowledge.
[02] In-Context Identification (ICI)
1. How do attention heads operate during the In-Context Identification (ICI) stage?
- Attention heads in the ICI stage use their QK matrices to focus on and identify overall structural, syntactic, and semantic information within the context.
- For structural information, Previous Head, Rare Words Head, and Duplicate Token Head attend to positional relationships, unique tokens, and repeated content, respectively.
- For syntactic information, Syntactic Head can identify and label nominal subjects, direct objects, and other grammatical elements. Subword Merge Head focus on merging subwords into complete words.
- For semantic information, Context Head extract task-relevant information, Content Gatherer Head move tokens related to the correct answer, and Sentiment Summarizer summarize sentiment-expressing words and phrases.
[03] Latent Reasoning (LR)
1. What are the key functions of attention heads in the Latent Reasoning (LR) stage?
- Attention heads in the LR stage perform implicit reasoning based on the information gathered during the KR and ICI stages, and write the reasoning results back into the residual stream.
- For In-context Learning, Task Recognition (TR) heads like Summary Reader can infer task labels from summarized information, while Task Learning (TL) heads like Induction Head can discover patterns and complete fill-in-the-blank reasoning.
- Truthfulness Head and Accuracy Head are correlated with the truthfulness and accuracy of the model's answers, while Vulnerable Head are overly sensitive to certain inputs, leading to incorrect results.
- Task-specific heads like Correct Letter Head, Iteration Head, and Successor Head specialize in matching options, iterative reasoning, and arithmetic operations, respectively.
[04] Expression Preparation (EP)
1. How do attention heads contribute to the Expression Preparation (EP) stage?
- EP heads aggregate information from the ICI and LR stages, such as Mixed Head which linearly combines various types of information.
- Amplification Head and Correct Head amplify the signal of the correct choice in multiple-choice tasks, ensuring it has the highest probability.
- Coherence Head and Faithfulness Head help align the model's output with the user's instructions and faithfully reflect its internal reasoning.
[05] Experimental Methods
1. What are the two main categories of experimental methods used to explore attention head mechanisms? The paper categorizes the experimental methods into two main types:
- Modeling-Free methods:
- Modification-Based: Directional Addition and Directional Subtraction
- Replacement-Based: Zero Ablation, Mean Ablation, and Naive Activation Patching
- Modeling-Required methods:
- Training-Required: Probing and training simplified Transformer models
- Training-Free: Defining scores that reflect specific phenomena, such as Retrieval Score and Negative Attention Score
[06] Evaluation
1. What are the two main types of evaluation used in attention head interpretability research? The paper identifies two types of evaluation:
- Mechanism Exploration Evaluation:
- Datasets that simplify problem difficulty and focus on evaluating the model's knowledge reasoning and knowledge recalling capabilities, such as ToyMovieReview and ToyMoodStory.
- Common Evaluation:
- Benchmarks that assess the model's overall capabilities in areas like knowledge reasoning, sentiment analysis, long context retrieval, and text comprehension.
[07] Limitations and Future Directions
1. What are the key limitations of current research on attention head interpretability?
- The application scenarios explored are relatively simple and lack generalizability across different tasks.
- Most research focuses on the mechanisms of individual heads, with limited understanding of the collaborative relationships among multiple heads.
- The conclusions of many studies lack mathematical proofs, relying primarily on experimental validation of hypotheses.
2. What are some potential future research directions?
- Exploring attention head mechanisms in more complex tasks, such as open-ended question answering and math problems.
- Investigating the robustness of attention head mechanisms against different prompts.
- Developing new experimental methods to verify the indivisibility and universal applicability of identified mechanisms.
- Building a comprehensive interpretability framework that encompasses the independent and collaborative functioning of attention heads and other model components.
- Integrating insights from machine psychology to construct an internal mechanism framework for LLMs from an anthropomorphic perspective.